Pooling ASR data for closely related languages
نویسندگان
چکیده
We describe several experiments that were conducted to assess the viability of data pooling as a means to improve speech-recognition performance for under-resourced languages. Two groups of closely related languages from the Southern Bantu language family were studied, and our tests involved phoneme recognition on telephone speech using standard tied-triphone Hidden Markov Models. Approximately 6 to 11 hours of speech from around 170 speakers was available for training in each language. We find that useful improvements in recognition accuracy can be achieved when pooling data from languages that are highly similar, with two hours of data from a closely related language being approximately equivalent to one hour of data from the target language in the best case. However, the benefit decreases rapidly as languages become slightly more distant, and is also expected to decrease when larger corpora are available. Our results suggest that similarities in triphone frequencies are the most accurate predictor of the performance of language pooling in the conditions studied here.
منابع مشابه
Optimizing Data Selection for Automatic Speech Recognition in Low Resource Languages
Developing Automatic Speech Recognition (ASR) systems for low resource languages is a labor-, computation-, and timeintensive task. Data selection techniques seek highly informative subsets of speech data for transcription and can lead to considerable reduction in time and expense for transcription and ASR training. This project investigates unsupervised and supervised data selection techniques...
متن کاملTarget Structured Cross Language Model Refinement
The task of porting Automatic Speech Recognition (ASR) technology to many languages is hindered by a lack of transcribed acoustic data, which in turn prevents the development of accurate acoustic models necessary for the recognition task. To overcome this problem, recent research has sought to exploit the similarity of sounds across languages, and use this similarity to adapt models from one or...
متن کاملA smartphone-based ASR data collection tool for under-resourced languages
Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with underresourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the dev...
متن کاملUsing resources from a closely-related language to develop ASR for a very under-resourced language: a case study for iban
This paper presents our strategies for developing an automatic speech recognition system for Iban, an under-resourced language. We faced several challenges such as no pronunciation dictionary and lack of training material for building acoustic models. To overcome these problems, we proposed approaches which exploit resources from a closely-related language (Malay). We developed a semi-supervise...
متن کاملComparison of acoustic modeling techniques for Vietnamese and Khmer ASR
This paper presents a comparison of some different acoustic modeling strategies for under-resourced languages. When only limited speech data are available for under-resourced languages, we propose some crosslingual acoustic modeling techniques. We apply and compare these techniques in Vietnamese ASR. Since there is no pronunciation dictionary for some underresourced languages, we investigate gr...
متن کامل